The code for the herein described process can also be freely downloaded from https://github.com/fzenoni/london-accidents.
Besides being able to “put dots on a map”, R can be used in very effective ways to do spatial analytics. Last month, Stefano already provided descriptive statistics on a Kaggle dataset that includes 16 years of UK accidents.
It is now time to move up a gear (no pun intended), and tackle one of the many ways of extracting spatial insights from the data. In this post, we are going to ignore the potential analysis brought by the time series. It will certainly be the topic of another post.
Instead we will try to propose an answer to the following questions: what if the government wanted to highlight the areas of a city showing some alarming characteristics, with a given statistical significance? What if we wanted to discover what are London’s most dangerous roads or intersections for car drivers?
In fact, London happens too large for our pruposes. I will only investigate a neighborhood, but the method shown in the following stays valid at any scale.
First things first, we load the data and clean it a bit. The fastest way to do it in-memory, while enjoying the functions devoted to tables, is still to use the data.table package.
# Load data
set <- list.files(path = '1-6m-accidents-traffic-flow-over-16-years',
pattern = 'accidents.*csv', full.names = TRUE)
data <- lapply(set, fread) %>% rbindlist
# Filter out empty locations
data <- na.omit(data, cols = c('Longitude', 'Latitude'))
# Remove duplicates
data <- data[!duplicated(data)]
data.table is nice and all, but since we work with spatial data we’re going to use the sf format, as we did in the past. As sf does not exactly extend data.table I’m going to cast the table to a data.frame first.
data <- data %>% as.data.frame %>% st_as_sf(coords = c('Longitude', 'Latitude'), crs = 4326)
This data include accidents over 16 years for the whole of UK. It represents a lot of records, and performance wise, I don’t necessarily have a strategy in place to analyze them all. As a first move, I’m going to select the events that fall inside the boroughs of London administrative boundaries. The Kaggle dataset luckily includes the geoJSON of UK’s districts. I’ve had mixed feelings in the past concerning this format, but the bad ones were wiped out by the recent discovery of two methods to open and import it as sf.
The first one is the classic sf::read_sf().
system.time(sf::read_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
## user system elapsed
## 7.069 0.592 8.074
But the freshest discovery is the geojsonsf::geojson_sf() function from SymbolixAU (check their blog post post here), that serves exactly our purpose, in an even faster way.
system.time(map <- geojsonsf::geojson_sf('1-6m-accidents-traffic-flow-over-16-years/Local_Authority_Districts_Dec_2016.geojson'))
## user system elapsed
## 1.819 0.117 2.615
head(map)
## Simple feature collection with 6 features and 10 fields
## geometry type: GEOMETRY
## dimension: XY
## bbox: xmin: -2.832457 ymin: 53.30503 xmax: -0.7884304 ymax: 54.72717
## epsg (SRID): 4326
## proj4string: +proj=longlat +datum=WGS84 +no_defs
## bng_e bng_n geometry lad16cd
## 1 447157 531476 MULTIPOLYGON (((-1.268456 5... E06000001
## 2 451141 516887 MULTIPOLYGON (((-1.243904 5... E06000002
## 3 464359 519597 MULTIPOLYGON (((-1.137578 5... E06000003
## 4 444937 518183 MULTIPOLYGON (((-1.317286 5... E06000004
## 5 428029 515649 POLYGON ((-1.637678 54.6171... E06000005
## 6 354246 382146 MULTIPOLYGON (((-2.626835 5... E06000006
## lad16nm lad16nmw lat long objectid st_areashape
## 1 Hartlepool 54.67616 -1.27023 1 93559511
## 2 Middlesbrough 54.54467 -1.21099 2 53888581
## 3 Redcar and Cleveland 54.56752 -1.00611 3 244820281
## 4 Stockton-on-Tees 54.55691 -1.30669 4 204962233
## 5 Darlington 54.53535 -1.56835 5 197475689
## 6 Halton 53.33424 -2.68853 6 79084033
## st_lengthshape
## 1 71707.33
## 2 43840.85
## 3 97993.29
## 4 119581.51
## 5 107206.28
## 6 77770.94
Now I would like to extract London’s borough, but I am bored by having to inspect the map, and the need to learn new geography. Since I am the laziest member of the team, I decided to harvest this Wikipedia page with the rvest package, and use the list to filter the regions of interest.
# Harvest the list of London boroughs from Wikipedia
wiki_london <- xml2::read_html('https://en.wikipedia.org/wiki/London_boroughs')
boroughs1 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[1]] %>% rvest::html_text()
boroughs2 <- wiki_london %>% rvest::html_nodes('ol') %>% .[[2]] %>% rvest::html_text()
list1 <- as.list(strsplit(boroughs1, "\n")) %>% unlist
list2 <- as.list(strsplit(boroughs2, "\n")) %>% unlist
list <- c(list1, list2)
# Special cases to fix
list <- replace(list, list=='City of London (not a London borough)', 'City of London')
list <- replace(list, list=='City of Westminster', 'Westminster')
# Filter map
london <- map %>% dplyr::filter(lad16nm %in% list)
# Unite the boroughs
london_union <- london %>% st_union
The London map is ready to be displayed.
plot(london_union)
As mentioned before, originally I wanted to analyze all of London’s data, but it seems that to do within today I would need to parallelize part of the analysis code (I have some plans, so maybe it will be a task for a post addendum). Instead, I will analyze the data falling into a radius of 2000 m from London’s centroid. For sf aficionados, this is a trivial task.
# Project to British National Grid (http://epsg.io/27700)
data <- data %>% st_transform(crs = 27700)
london_union <- london_union %>% st_transform(crs = 27700)
# Build a circle
center <- st_centroid(london_union)
max_radius <- 2000
london_circle <- st_buffer(center, max_radius)
# Filter data thanks to map
london_data <- data[london_circle,]
This is what we got. I know, I know, it’s a small sample!
# Original amount of data
nrow(data)
## [1] 1469894
# Filtered data
nrow(london_data)
## [1] 9205
# Draw the area
plot(london_union)
sf.colors(alpha = 0.4)
## [1] "#0000B366" "#0400FF66" "#4500FF66" "#8500FF66" "#C527D866"
## [6] "#FF50AF66" "#FF7A8566" "#FFA35C66" "#FFCC3366" "#FFF50A66"
plot(london_circle, add = T, col = sf.colors(n = 1, alpha = 0.3))
# Display the accidents as points
mapview(london_data)